In [1]:
import pandas as pd
import numpy as np
import networkx as nx
There's a whole bunch of hubway data available over at the Hubway Data Challenge website.
There's already been a bunch of visualizations submitted, especially since the submission date was way back in 2012 (I think).
I decided I'd try my hand, especially since sigma.js seems like such a useful visualization tool.
I also wanted to see if a community detection algorithm on the graph would discover anything interesting (or otherwise obvious) about the communiting patterns of hubway users.
The pipeline looks like this:
Okay, here we go:
First, import all the station and trip data from hubway.
Here's what they look like after pulling them into pandas:
In [2]:
stations = pd.read_csv('../data/stations_10_12_to_11_13.csv',
index_col=0)
trips = pd.read_csv('../data/hubwaydata_10_12_to_11_13.csv',
index_col=0,
parse_dates=['start_date', 'end_date'])
In [125]:
# sigma.js requres the node IDs to be strings.
stations = stations.set_index(stations.index.map(str))
trips.start_station = trips.start_station.map(str)
trips.end_station = trips.end_station.map(str)
In [128]:
stations.head(2)
Out[128]:
In [127]:
trips.head(2)
Out[127]:
Next, we've gotta create the graph.
To do this, I iterate through all the rows of the stations dataframe, adding nodes to the graph as I go.
I weight an edge between two stations $n_1$ and $n_2$ by the number of trips taken from $n_1$ to $n_2$, $t_{1,2}$.
(now that I'm writing this, I should probably normalize these weights by the total number of trips leaving a station in general. I'll update the notebook later.)
In [129]:
DG = nx.DiGraph()
# add station nodes
# sigma.js requires a string ID, an X, and a Y for each node.
for n, r in stations.iterrows():
DG.add_node(n,
label=r.station,
y=r.lat,
x=r.lng)
# add edges for trips.
# There's an edge between two stations if there was a trip between them.
# Edges are weighted by the number of trips
for n, r in trips.iterrows():
if DG.has_edge(r.start_station, r.end_station):
DG[r.start_station][r.end_station]['weight'] += 1
else:
DG.add_edge(r.start_station, r.end_station, weight=1)
Here's what the gross networkx visualization looks like (with the Fruchterman-Reingold layout):
In [140]:
pos = nx.fruchterman_reingold_layout(DG)
nx.draw_networkx(DG, pos)
There's a nice little python module called communities that implements the Louvain community detection method.
I'm pretty ignorant to best practices here, so I just ran it a few times until I saw a layout in the induced community graph (image #2).
In [131]:
import community
DG = nx.Graph(DG)
partition = community.best_partition(DG)
size = float(len(set(partition.values())))
pos = nx.spring_layout(DG)
count = 0
for com in set(partition.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition.keys()
if partition[nodes] == com]
nx.draw_networkx_nodes(DG, pos, list_nodes, node_size = 20,
node_color = str(count / size))
nx.draw_networkx_edges(DG,pos, alpha=0.8)
plt.show()
# let's see if it makes sense:
ind_graph = community.induced_graph(partition, DG)
#pos = nx.spring_layout(ind_graph)
pos = nx.fruchterman_reingold_layout(ind_graph)
nx.draw_networkx(ind_graph, pos)
Great. Now that we've run the community detection, we can then add a 'community' attribute to each node, and then export it to .gexf for use in gephi.
In [132]:
# add the communities to the graph
for n, d in DG.nodes_iter(data=True):
DG.node[n]['community'] = partition[n]
In [ ]:
nx.write_gexf(DG, '../sigma/trips.json')
Messing around with the network is beyond the scope of this post/notebook.
In Gephi, I did the following things:
You can see the finished product produced with sigma.js here